Goto

Collaborating Authors

 question mark


Provable Benefits of In-Tool Learning for Large Language Models

Houliston, Sam, Odonnat, Ambroise, Arnal, Charles, Cabannes, Vivien

arXiv.org Machine Learning

Tool-augmented language models, equipped with retrieval, memory, or external APIs, are reshaping AI, yet their theoretical advantages remain underexplored. In this paper, we address this question by demonstrating the benefits of in-tool learning (external retrieval) over in-weight learning (memorization) for factual recall. We show that the number of facts a model can memorize solely in its weights is fundamentally limited by its parameter count. In contrast, we prove that tool-use enables unbounded factual recall via a simple and efficient circuit construction. These results are validated in controlled experiments, where tool-using models consistently outperform memorizing ones. We further show that for pretrained large language models, teaching tool-use and general rules is more effective than finetuning facts into memory. Our work provides both a theoretical and empirical foundation, establishing why tool-augmented workflows are not just practical, but provably more scalable.


Punctuation and Predicates in Language Models

Chauhan, Sonakshi, Chaudhary, Maheep, Choy, Koby, Nellessen, Samuel, Schoots, Nandi

arXiv.org Artificial Intelligence

In this paper we explore where information is collected and how it is propagated throughout layers in large language models (LLMs). We begin by examining the surprising computational importance of punctuation tokens which previous work has identified as attention sinks and memory aids. Using intervention-based techniques, we evaluate the necessity and sufficiency (for preserving model performance) of punctuation tokens across layers in GPT-2, DeepSeek, and Gemma. Our results show stark model-specific differences: for GPT-2, punctuation is both necessary and sufficient in multiple layers, while this holds far less in DeepSeek and not at all in Gemma. Extending beyond punctuation, we ask whether LLMs process different components of input (e.g., subjects, adjectives, punctuation, full sentences) by forming early static summaries reused across the network, or if the model remains sensitive to changes in these components across layers. Extending beyond punctuation, we investigate whether different reasoning rules are processed differently by LLMs. In particular, through interchange intervention and layer-swapping experiments, we find that conditional statements (if, then), and universal quantification (for all) are processed very differently. Our findings offer new insight into the internal mechanisms of punctuation usage and reasoning in LLMs and have implications for interpretability.


Limitation Learning: Catching Adverse Dialog with GAIL

Kasmanoff, Noah, Zalkikar, Rahul

arXiv.org Artificial Intelligence

Imitation learning is a proven method for creating a policy in the absence of rewards, by leveraging expert demonstrations. In this work, we apply imitation learning to conversation. In doing so, we recover a policy capable of talking to a user given a prompt (input state), and a discriminator capable of classifying between expert and synthetic conversation. While our policy is effective, we recover results from our discriminator that indicate the limitations of dialog models. We argue that this technique can be used to identify adverse behavior of arbitrary data models common for dialog oriented tasks.


Generating Pedagogically Meaningful Visuals for Math Word Problems: A New Benchmark and Analysis of Text-to-Image Models

Wang, Junling, Rutkiewicz, Anna, Wang, April Yi, Sachan, Mrinmaya

arXiv.org Artificial Intelligence

Visuals are valuable tools for teaching math word problems (MWPs), helping young learners interpret textual descriptions into mathematical expressions before solving them. However, creating such visuals is labor-intensive and there is a lack of automated methods to support this process. In this paper, we present Math2Visual, an automatic framework for generating pedagogically meaningful visuals from MWP text descriptions. Math2Visual leverages a pre-defined visual language and a design space grounded in interviews with math teachers, to illustrate the core mathematical relationships in MWPs. Using Math2Visual, we construct an annotated dataset of 1,903 visuals and evaluate Text-to-Image (TTI) models for their ability to generate visuals that align with our design. We further fine-tune several TTI models with our dataset, demonstrating improvements in educational visual generation. Our work establishes a new benchmark for automated generation of pedagogically meaningful visuals and offers insights into key challenges in producing multimodal educational content, such as the misrepresentation of mathematical relationships and the omission of essential visual elements.


'Jeopardy' host Ken Jennings 'deeply skeptical' of AI, years after losing to supercomputer

FOX News

"Jeopardy!" host Ken Jennings tells Fox News Digital he wants to know a human is behind any creative projects, not AI. "I'm deeply skeptical of AI," Jennings told Fox News Digital at the TCM Classic Film Festival. "Obviously, these current iterations of LLMs [Large Language Models] would clean Watson's clock at'Jeopardy!' The technology has moved on. I've played with chatbots and'Jeopardy!' clues, and they're very hard to stump," he said.


The Jumping Reasoning Curve? Tracking the Evolution of Reasoning Performance in GPT-[n] and o-[n] Models on Multimodal Puzzles

Toh, Vernon Y. H., Chia, Yew Ken, Ghosal, Deepanway, Poria, Soujanya

arXiv.org Artificial Intelligence

In our evaluation, we assess the performance of GPT-[n] and o-[n] models on abstract multimodal puzzles from PuzzleVQA, which primarily test abstract reasoning. Additionally, we evaluate the models on AlgoPuzzleVQA, which require an algorithmic approach rather than brute-force solving. To ensure a comprehensive evaluation, we present the puzzles in both multiple-choice and openended question answering formats. Our findings indicate that despite their sophisticated capabilities in standard benchmarks, current models still struggle with seemingly simple multimodal puzzles (Figure 3). Contrary to previous benchmarks such as ARC-AGI, we observe a less dramatic reasoning curve without extreme jumps in performance. This limitation highlights the substantial gap between current artificial intelligence and human-like reasoning abilities. As the models continue to rapidly advance and scale as in Figure 1, this benchmark will serve as a critical indicator of progress toward more robust and generalized artificial intelligence. Overall, here are the key findings of our study: TL;DR 1. Performance steadily improves from GPT-4-Turbo to GPT-4o to o1. While the jump from GPT-4-Turbo to GPT-4o is moderate, the transition from GPT-4o to o1 marks a significant advancement but it comes at a cost of 750x more inference cost.


Spontaneous Informal Speech Dataset for Punctuation Restoration

Liu, Xing Yi, Beigi, Homayoon

arXiv.org Artificial Intelligence

Presently, punctuation restoration models are evaluated almost solely on well-structured, scripted corpora. On the other hand, real-world ASR systems and post-processing pipelines typically apply towards spontaneous speech with significant irregularities, stutters, and deviations from perfect grammar. To address this discrepancy, we introduce SponSpeech, a punctuation restoration dataset derived from informal speech sources, which includes punctuation and casing information. In addition to publicly releasing the dataset, we contribute a filtering pipeline that can be used to generate more data. Our filtering pipeline examines the quality of both speech audio and transcription text. We also carefully construct a ``challenging" test set, aimed at evaluating models' ability to leverage audio information to predict otherwise grammatically ambiguous punctuation. SponSpeech is available at https://github.com/GitHubAccountAnonymous/PR, along with all code for dataset building and model runs.


NUMCoT: Numerals and Units of Measurement in Chain-of-Thought Reasoning using Large Language Models

Xu, Ancheng, Tan, Minghuan, Wang, Lei, Yang, Min, Xu, Ruifeng

arXiv.org Artificial Intelligence

Numeral systems and units of measurement are two conjoined topics in activities of human beings and have mutual effects with the languages expressing them. Currently, the evaluation of Large Language Models (LLMs) often involves mathematical reasoning, yet little attention is given to how minor changes in numbers or units can drastically alter the complexity of problems and the performance of LLMs. In this paper, we scrutinize existing LLMs on processing of numerals and units of measurement by constructing datasets with perturbations. We first anatomize the reasoning of math word problems to different sub-procedures like numeral conversions from language to numbers and measurement conversions based on units. Then we further annotate math word problems from ancient Chinese arithmetic works which are challenging in numerals and units of measurement. Experiments on perturbed datasets demonstrate that LLMs still encounter difficulties in handling numeral and measurement conversions.


Prosody in Cascade and Direct Speech-to-Text Translation: a case study on Korean Wh-Phrases

Zhou, Giulio, Lam, Tsz Kin, Birch, Alexandra, Haddow, Barry

arXiv.org Artificial Intelligence

Speech-to-Text Translation (S2TT) has typically been addressed with cascade systems, where speech recognition systems generate a transcription that is subsequently passed to a translation model. While there has been a growing interest in developing direct speech translation systems to avoid propagating errors and losing non-verbal content, prior work in direct S2TT has struggled to conclusively establish the advantages of integrating the acoustic signal directly into the translation process. This work proposes using contrastive evaluation to quantitatively measure the ability of direct S2TT systems to disambiguate utterances where prosody plays a crucial role. Specifically, we evaluated Korean-English translation systems on a test set containing wh-phrases, for which prosodic features are necessary to produce translations with the correct intent, whether it's a statement, a yes/no question, a wh-question, and more. Our results clearly demonstrate the value of direct translation systems over cascade translation models, with a notable 12.9% improvement in overall accuracy in ambiguous cases, along with up to a 15.6% increase in F1 scores for one of the major intent categories. To the best of our knowledge, this work stands as the first to provide quantitative evidence that direct S2TT models can effectively leverage prosody. The code for our evaluation is openly accessible and freely available for review and utilisation.


Dialogue Systems Can Generate Appropriate Responses without the Use of Question Marks? -- Investigation of the Effects of Question Marks on Dialogue Systems

Mizumoto, Tomoya, Yamazaki, Takato, Yoshikawa, Katsumasa, Ohagi, Masaya, Kawamoto, Toshiki, Sato, Toshinori

arXiv.org Artificial Intelligence

When individuals engage in spoken discourse, various phenomena can be observed that differ from those that are apparent in text-based conversation. While written communication commonly uses a question mark to denote a query, in spoken discourse, queries are frequently indicated by a rising intonation at the end of a sentence. However, numerous speech recognition engines do not append a question mark to recognized queries, presenting a challenge when creating a spoken dialogue system. Specifically, the absence of a question mark at the end of a sentence can impede the generation of appropriate responses to queries in spoken dialogue systems. Hence, we investigate the impact of question marks on dialogue systems, with the results showing that they have a significant impact. Moreover, we analyze specific examples in an effort to determine which types of utterances have the impact on dialogue systems.